Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PermutedDimsArray in gemm_strided_batched #539

Merged
merged 17 commits into from
Nov 17, 2020

Conversation

mcabbott
Copy link
Contributor

This is copied from JuliaGPU/CuArrays.jl#664, which needed FluxML/NNlib.jl#191, now merged (but not yet tagged).

Summarising:

CUBLAS's gemm_strided_batched! accepts strides apart from the ordered ones of a CuArray. This is potentially useful as it saves on performing permutedims before operations, and the obvious way to expose this is to let it accept PermutedDimsArrays.

I presume that by the time anyone is calling CuArrays.CUBLAS.gemm_strided_batched!, they are well aware that the operation is to be done by CUBLAS, and not expecting dispatch to redirect to the CPU if it's an Array. So I changed this to accept any AbstractArray.

Then NNlib.batched_mul will be a friendly function which dispatches every kind of array to the best routine. I hope I got this sorted out! To perform e.g. matrix * 3-tensor contractions like the table here, the matrix should be reshaped to another 3-array, with a size=1 dimension to indicate no batch index, or batched matrix-vector, etc.

@mcabbott mcabbott changed the title allow PermutedDimsArray in gemm_strided_batched Allow PermutedDimsArray in gemm_strided_batched Nov 11, 2020
@maleadt
Copy link
Member

maleadt commented Nov 12, 2020

I presume that by the time anyone is calling CuArrays.CUBLAS.gemm_strided_batched!, they are well aware that the operation is to be done by CUBLAS, and not expecting dispatch to redirect to the CPU if it's an Array. So I changed this to accept any AbstractArray.

It serves as documentation too. Looking at the name of the function one could expect being able to use it with actual strided inputs (i.e. a non-contiguous view); which the DenseCuArray type specification prevents.

Please rebase on latest master to get CI.

@mcabbott
Copy link
Contributor Author

It's a pity to lose the documentation aspect, but is there another way? Inserting some huge union isn't going to result in a readable methods(). Perhaps I could insert one more step to _gemm_strided_batched! with a wider signature, which the NNlib function calls?

Can rebase but I think CI won't work until NNlib is tagged.

@maleadt
Copy link
Member

maleadt commented Nov 12, 2020

Can rebase but I think CI won't work until NNlib is tagged.

Oh, right. You can always add an explicit dep to the Manifest if you want to run CI before that happens.

Perhaps I could insert one more step to _gemm_strided_batched! with a wider signature, which the NNlib function calls?

Eh, maybe just add a comment at the end of the line.
Does this actually test the PermuteDims case? Otherwise it might be easy to accidentally refactor this part of the changes away.

@mcabbott
Copy link
Contributor Author

mcabbott commented Nov 14, 2020

Good point about tests, I've added a few.

And put NNlib#master in the manifest to test on CI. (Documentation still fails, I presume because it's on a different manifest.)

This needs something to make pointers, possibly just Base.unsafe_convert(::Type{CuPtr{T}}, A::PermutedDimsArray) = .... Where should this live? I mean should Adapt or some other package somehow take care of this, instead of writing it directly?

@maleadt
Copy link
Member

maleadt commented Nov 16, 2020

This needs something to make pointers, possibly just Base.unsafe_convert(::Type{CuPtr{T}}, A::PermutedDimsArray) = .... Where should this live? I mean should Adapt or some other package somehow take care of this, instead of writing it directly?

These conversions live in array.jl:

CUDA.jl/src/array.jl

Lines 370 to 490 in 1139ffd

## views
# optimize view to return a CuArray when contiguous
struct Contiguous end
struct NonContiguous end
# NOTE: this covers more cases than the I<:... in Base.FastContiguousSubArray
CuIndexStyle() = Contiguous()
CuIndexStyle(I...) = NonContiguous()
CuIndexStyle(::Base.ScalarIndex...) = Contiguous()
CuIndexStyle(i1::Colon, ::Base.ScalarIndex...) = Contiguous()
CuIndexStyle(i1::AbstractUnitRange, ::Base.ScalarIndex...) = Contiguous()
CuIndexStyle(i1::Colon, I...) = CuIndexStyle(I...)
cuviewlength() = ()
@inline cuviewlength(::Real, I...) = cuviewlength(I...) # skip scalar
@inline cuviewlength(i1::AbstractUnitRange, I...) = (Base.unsafe_length(i1), cuviewlength(I...)...)
@inline cuviewlength(i1::AbstractUnitRange, ::Base.ScalarIndex...) = (Base.unsafe_length(i1),)
@inline function Base.view(A::CuArray, I::Vararg{Any,N}) where {N}
J = to_indices(A, I)
@boundscheck begin
# Base's boundscheck accesses the indices, so make sure they reside on the CPU.
# this is expensive, but it's a bounds check after all.
J_cpu = map(j->adapt(Array, j), J)
checkbounds(A, J_cpu...)
end
J_gpu = map(j->adapt(CuArray, j), J)
unsafe_view(A, J_gpu, CuIndexStyle(I...))
end
@inline function unsafe_view(A, I, ::Contiguous)
unsafe_contiguous_view(Base._maybe_reshape_parent(A, Base.index_ndims(I...)), I, cuviewlength(I...))
end
@inline function unsafe_contiguous_view(a::CuArray{T}, I::NTuple{N,Base.ViewIndex}, dims::NTuple{M,Integer}) where {T,N,M}
offset = Base.compute_offset1(a, 1, I) * sizeof(T)
alias(a.baseptr)
b = CuArray{T,M}(a.baseptr, dims, a.ctx; offset=a.offset+offset)
finalizer(unsafe_free!, b)
b.state = ARRAY_MANAGED
return b
end
@inline function unsafe_view(A, I, ::NonContiguous)
Base.unsafe_view(Base._maybe_reshape_parent(A, Base.index_ndims(I...)), I...)
end
# pointer conversions
## contiguous
function Base.unsafe_convert(::Type{CuPtr{T}}, V::SubArray{T,N,P,<:Tuple{Vararg{Base.RangeIndex}}}) where {T,N,P}
return Base.unsafe_convert(CuPtr{T}, parent(V)) +
Base._memory_offset(V.parent, map(first, V.indices)...)
end
## reshaped
function Base.unsafe_convert(::Type{CuPtr{T}}, V::SubArray{T,N,P,<:Tuple{Vararg{Union{Base.RangeIndex,Base.ReshapedUnitRange}}}}) where {T,N,P}
return Base.unsafe_convert(CuPtr{T}, parent(V)) +
(Base.first_index(V)-1)*sizeof(T)
end
## reshape
# optimize reshape to return a CuArray
function Base.reshape(a::CuArray{T,M}, dims::NTuple{N,Int}) where {T,N,M}
if prod(dims) != length(a)
throw(DimensionMismatch("new dimensions $(dims) must be consistent with array size $len"))
end
if N == M && dims == size(a)
return a
end
alias(a.baseptr)
b = CuArray{T,N}(a.baseptr, dims, a.ctx; offset=a.offset)
finalizer(unsafe_free!, b)
b.state = ARRAY_MANAGED
return b
end
# allow missing dimensions with Colon()
if VERSION < v"1.6.0-DEV.1358"
Base.reshape(parent::CuArray, dims::Tuple{Vararg{Union{Int,Colon}}}) =
Base.reshape(parent, Base._reshape_uncolon(parent, dims))
end
## reinterpret
# optimize reshape to return a CuArray
function Base.reinterpret(::Type{T}, a::CuArray{S,N}) where {T,S,N}
if N == 0 && sizeof(T) == sizeof(S)
throw(ArgumentError("cannot reinterpret a zero-dimensional `$(S)` array to `$(T)` which is of a different size"))
end
if N != 0 && sizeof(S) != sizeof(T)
ax1 = axes(a)[1]
dim = length(ax1)
if Base.rem(dim*sizeof(S),sizeof(T)) != 0
throw(ArgumentError("""
cannot reinterpret an `$(S)` array to `$(T)` whose first dimension has size `$(dim)`.
The resulting array would have non-integral first dimension.
"""))
end
if first(ax1) != 1
throw(ArgumentError("cannot reinterpret a `$(S)` array to `$(T)` when the first axis is $ax1. Try reshaping first."))
end
end
isize = size(a)
size1 = div(isize[1]*sizeof(S), sizeof(T))
osize = tuple(size1, Base.tail(isize)...)
alias(a.baseptr)
b = CuArray{T,N}(a.baseptr, osize, a.ctx; offset=a.offset)
finalizer(unsafe_free!, b)
b.state = ARRAY_MANAGED
return b
end
. We used to have tthem for ReshapedArray and ReinterpretArray too, but recently we've switched back to implementing those with CuArray again.

@mcabbott
Copy link
Contributor Author

Great, I've moved the pointer as you describe.

NNlib is now tagged, and I adjusted the bound in Project.toml. But CI doesn't seem to see v0.7.7, perhaps it needs more time.

@maleadt
Copy link
Member

maleadt commented Nov 17, 2020

Sometimes the PkgServer lags General a bit. I restarted the build, and there's a normal failure in the NNlib tests now.

@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #539 (0be7a0e) into master (cd1191f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #539   +/-   ##
=======================================
  Coverage   77.82%   77.83%           
=======================================
  Files         116      116           
  Lines        6423     6425    +2     
=======================================
+ Hits         4999     5001    +2     
  Misses       1424     1424           
Impacted Files Coverage Δ
lib/cublas/wrappers.jl 90.98% <100.00%> (+0.01%) ⬆️
src/array.jl 88.42% <100.00%> (+0.06%) ⬆️
src/nnlib.jl 14.28% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1191f...0be7a0e. Read the comment docs.

@maleadt maleadt added cuda array Stuff about CuArray. enhancement New feature or request labels Nov 17, 2020
@maleadt
Copy link
Member

maleadt commented Nov 17, 2020

All green, let's merge this!

@maleadt maleadt merged commit d16a4b3 into JuliaGPU:master Nov 17, 2020
@mcabbott mcabbott deleted the batchedmul191 branch November 17, 2020 12:09
@mcabbott
Copy link
Contributor Author

Thanks! And sorry about the excessive CI-ing, trying to finish this while away from my desktop maybe wasn't the brightest plan.

@maleadt
Copy link
Member

maleadt commented Nov 17, 2020

No problem, that's what its there for. But it prompted me to add a smoke test that has to succeed before starting 10 other jobs 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants